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Abstract 

This  paper  provides  the  customized  MVA  equations  for  an  analytical  model  for  evaluating  architectural  alternatives  for 
shared-memory  multiprocessors  with  processors  that  aggressively  exploit  instruction-level  parallelism  (ILP).  Compared  to 
simulation,  the  analytical  model  is  many  orders  of  magnitude  faster  to  solve,  yielding  highly  accurate  system  performance 
estimates  in  seconds. 


1  Introduction 

In  [8],  we  presented  an  analytical  model  for  evaluating  specific  types  of  architectural  trade-offs  for  shared-memory  systems 
with  ILP  processors.  As  shown  in  that  paper,  the  analytical  model  validates  extremely  well  against  detailed  simulation  and 
produces  results  in  a  few  seconds. 

The  principal  aspects  of  the  model  are: 

•  The  ILP  processor  and  its  associated  two-level  cache  system  are  viewed  as  a  black  box  that  generates  requests  to  the 
memory  system  and  intermittently  blocks  after  a  dynamically  changing  number  of  requests. 

•  We  iterate  between  two  submodels;  one  represents  the  blocking  behavior  due  to  load  misses  that  cannot  be  retired  until 
the  data  returns  from  memory,  and  the  other  submodel  represents  the  blocking  behavior  due  to  the  hardware  constraint 
on  the  total  number  of  outstanding  memory  requests. 

•  In  each  submodel,  the  memory  system  is  viewed  as  a  system  of  queues  (e.g.,  the  memory  bus,  DRAM  modules 
and  associated  directories,  and  network  interfaces)  and  delay  centers  (e.g.,  switches  in  the  interconnection  network). 
We  create  a  set  of  intuitive  customized  mean  value  analysis  (CMVA)  equations  to  obtain  estimates  of  throughput 
(instructions  per  cycle)  in  each  submodel.  The  CMVA  technique  has  proven  to  be  accurate  in  validation  experiments 
for  a  number  of  simpler  architectural  models  [9], 

The  purpose  of  this  technical  report  is  to  provide  the  details  of  the  customized  MVA  equations  which  were  omitted  in  [8] 
due  to  space  constraints.  Section  2  of  this  report  provides  the  model  input  parameters.  Section  3  provides  an  overview  of 
the  analytical  model,  and  Section  4  presents  the  customized  MVA  equations.  Further  discussion  of  the  model,  including 
validation  and  applications,  can  be  found  in  [8], 
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HRD-9896132,  MIP-9625558,  CDA-9623632,  CCR-9410457,  CCR-9502500,  CDA-9502791,  and  CDA-9617383.  Sarita  V.  Adve  is  also  supported  in 
part  by  an  IBM  University  Partnership  award  and  by  the  Texas  Advanced  Technology  Program  under  Grant  No.  003604-025.  Vijay  S.  Pai  is  supported  by  a 
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parameter 

description 

N 

number  of  nodes 

m 

memory  modules  per  node 

Mhw 

number  of  MSHRs 

SNIout,r 

NI  send  occupancy  for  request 

S]V  Iout,d 

NI  send  occupancy  for  data 

SNIin,r 

NI  receive  occupancy  for  request 

SNIin,d 

NI  receive  occupancy  for  data 

Sbus,r 

bus  occupancy  for  request 

Sbus,d 

bus  occupancy  for  data 

Smem 

memory /directory  (DRAM)  access 

Stag 

L2  tag  check 

S switch 

per-header  network  switch  occupancy 

Table  1.  System  Architecture  Parameters 


Parameter 

Description 

r 

Average  time  between  read,  write,  or  upgrade  requests  to  memory,  not  counting  the  time  when 
the  processor  is  completely  stalled  or  is  spin-waiting  on  a  synchronization  event 

CVT 

Coefficient  of  Variation  of  r 

f synch— write 

Fraction  of  write  requests  that  are  generated  by  atomic  read-modify-write  instructions  or  that 
coalesce  with  at  least  one  later  read 

/m 

Fraction  of  processor  stalls  that  find  M  MSHRs  with  outstanding  read  requests 

P read j  P write •>  P upgrade 

Probability  that  a  memory  request  is  a  read,  write,  or  upgrade 

P wb 

Probability  that  a  read  or  write  request  causes  a  writeback  of  a  cache  block 

PL\x 

Probability  directory  is  local  for  a  type  x  transaction;  j  -rcad,  write,  upgrade,  writeback 

PM\x,y 

Probability  home  memory  can  supply  the  data  for  a  type  x,  y  request; 

./-read,  write;  y=local  home,  remote  home 

Pshoplx&not— memory 

Probability  that  a  request  of  type  x  to  a  remote  home  is  forwarded  to  a  cache  at  a  third  node; 
x=read,  write 

H 

Average  number  of  network  switches  traversed  by  a  packet 

X 

Average  number  of  invalidates  caused  by  a  write  or  upgrade  to  a  clean  line 

Table  2.  Application  Parameters 
2  System  Architecture  and  Model  Parameters 

2.1  System  Architecture 

The  architecture  modeled  is  a  cache-coherent,  release  consistent  shared-memory  multiprocessor  system  where  the  pro¬ 
cessing  nodes  are  connected  by  a  mesh  interconnection  network  [8], 

2.2  Model  Parameters 

Model  parameters  can  be  classified  as  either  describing  the  system  or  describing  the  application.  Table  1  defines  the 
system  parameters,  while  Table  2  summarizes  the  application  parameters.  From  the  parameters  in  Table  2,  we  can  compute 
the  probabilities  of  the  protocol  transactions  in  Table  3. 

The  first  four  parameters  in  Table  2  characterize  the  ability  of  the  processor  to  overlap  multiple  memory  requests  while 
running  a  given  compiled  application  (or  set  of  applications).  These  parameters,  referred  to  as  ILP  parameters,  are  discussed 
in  more  detail  below.  The  other  parameters  in  the  table  are  standard  parameters  for  models  of  architectures  based  on  directory 
coherence  protocols  [1],  Note  that  the  parameters  are  defined  for  homogeneous  applications ;  that  is,  each  processor  has  the 
same  value  for  each  parameter  in  the  table,  and  memory  requests  are  assumed  to  be  equally  distributed  across  the  relevant 
memory  modules  (local  or  remote)  due  to  interleaving  and  effective  data  layout.  There  is  a  natural  extension  of  these 
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parameters  for  non-homogeneous  applications,  but  for  simplicity  in  the  model  exposition  we  use  the  given  parameters. 

The  parameter  r  is  the  average  time  between  requests  generated  by  the  processor  to  the  (main)  memory  subsystem,  not 
including  the  time  that  the  processor  is  stalled  or  is  spin-waiting  on  a  synchronization  event  such  as  a  lock  release,  flag,  or 
barrier  completion.  We  also  measure  the  coefficient  of  variation  of  r,  CVT .  r  is  well-defined  for  simple  processors  that  block 
on  each  load  and  store,  whereas  the  notion  that  a  complex  modern  processor  is  stalled  has  several  possible  definitions.  For 
the  robust  parameter  r  that  is  needed  for  the  model,  the  processor  is  defined  to  be  stalled  when  it  is  completely  stalled',  that  is, 
the  functional  units  are  completely  idle,  no  further  instructions  can  be  retired  or  issued  until  data  returns  from  memory,  and 
all  outstanding  cache  requests  are  waiting  for  data  from  main  memory.  The  fraction  of  time  a  processor  is  completely  stalled 
is  one  of  the  performance  metrics  estimated  by  the  analytic  model.  The  parameter  r  does  not  include  this  time. 

The  f synch-write  parameter  is  the  fraction  of  write  requests  that  are  synchronous;  that  is,  they  are  generated  by  read- 
modify-write  requests  or  they  coalesce  with  at  least  one  later  read  miss.  Read  misses  that  coalesce  with  earlier  read  requests 
are  completely  invisible  to  the  model  because  they  do  not  generate  any  memory  system  traffic,  and  they  do  not  cause  any 
new  blocking  behavior.  Thus,  a  parameter  for  the  frequency  of  read-read  coalescing  is  not  needed.  Likewise  for  writes  that 
coalesce  with  previous  misses. 

The  set  of  parameters  f  m  ■,  M  >  1,  are  the  fractions  of  processor  stall  events  that  have  M  MSHRs  occupied  with  read 
misses.  These  fractions  are  defined  and  measured  for  a  system  with  a  number  of  MSHRs  larger  than  the  maximum  value  that 
will  be  evaluated  with  the  model.  We  will  refer  to  such  a  system  as  an  “infinite  MSHR”  system.  Note  that  if  a  read  miss 
occurs  for  a  line  that  has  a  prior  write  miss  outstanding,  then  the  miss  is  counted  as  a  read  miss  when  measuring  M.  Also 
note  that  misspeculated  reads  are  counted  in  M .  The  /m  parameters  are  unique  to  a  system  with  non-blocking  loads. 

We  have  verified  that  the  application  input  parameters  are  relatively  insensitive  to  changes  in  the  memory  system  archi¬ 
tectural  parameters  that  can  be  varied  in  the  model  (e.g.,  the  number  of  MSHRs,  the  speed  of  the  bus  and  interconnection 
network  switches,  main  memory  configuration,  etc.).  However,  the  application  parameters  are  sensitive  to  various  parameters 
of  the  processor  and  cache  architecture.  For  example,  r,  CVT,  f  m .  and  f  synch-write  are  sensitive  to  the  instruction  window 
size. 

3  The  Analytic  Model 

The  principal  output  measure  computed  by  the  model  is  the  system  throughput,  measured  in  instructions  retired  per  cycle 
(IPC).  This  throughput  is  computed  as  a  function  of  the  input  parameters  that  characterize  the  workload  and  the  memory 
architecture.  The  customized  MVA  equations  defined  in  this  report  assume  that  the  directory  lookup  is  coupled  with  memory 
access,  so  a  single  service  time  applies  to  the  parallel  memory  and  directory  lookup. 

3.1  Model  Overview 

We  use  the  term  synchronous  for  read  requests  (and  for  read-modify-write  requests)  because  the  data  must  return  before  a 
load  (or  read-modify-write)  instruction  is  retired  from  the  instruction  window.  Other  requests  (writes,  upgrades,  writebacks, 
invalidates,  and  acknowledgments)  are  asynchronous.  Table  3  defines  all  of  the  memory  system  transactions. 

A  key  question  in  developing  the  analytic  model  is  how  to  compute  throughput  as  a  function  of  the  dynamically  changing 
number  of  outstanding  memory  requests  that  can  be  issued  before  the  processor  must  stall  waiting  for  data  to  return  from 
memory.  We  address  this  issue  by  iterating  between  the  following  two  submodels  for  each  value  of  M,  1  <  M  <  M : 

•  the  synchronous  blocking  submodel  (SB)  that  computes  the  fraction  of  time  the  processor  is  stalled  due  to  load  or 
read-modify-write  instructions  that  cannot  be  retired  until  the  data  returns  from  memory, 

•  the  MSHR  blocking  submodel  (MB)  that  computes  the  additional  fraction  of  time  the  processor  is  stalled  purely  due  to 
the  MSHRs  being  full. 

For  M  =  Mhw  we  compute  throughput  from  a  modified  version  of  the  MSHR-blocking  submodel  alone,  as  explained  below. 
Once  these  throughputs  are  computed,  we  compute  the  weighted  sum  of  the  throughputs,  weighted  by  the  frequency  of  each 
throughput  value  that  would  be  observed  for  the  number  of  MSHRs  in  the  system.  This  frequency  can  in  turn  be  computed 
from  the  model  input  parameters,  Jm-  The  remainder  of  this  section  gives  the  most  pertinent  details  of  the  two  submodels 
as  well  as  how  slowdown  due  to  synchronization  delays  is  computed;  the  full  set  of  equations  for  the  submodels  is  given  in 
Section  4. 
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Reads 

Upgrades 

local  home 

remote  home 

local 

remote 

memory 

remote 

cache 

memory 

cache 
at  home 

cache  at 
non-home 

transaction  name 

FC 

FHCC 

RC 

RHLCC 

RHRCC 

FUPG 

RUPG 

Writes 

Writebacks 

local  home 

remote  home 

local 

remote 

memory 

remote 

cache 

memory 

cache 

at  home 

cache  at 

non-home 

transaction  name 

FCINV 

FHCCwr 

RCINV 

RHFCCwr 

RHRCCwr 

FWB 

RWB 

Table  3.  Protocol  Transactions 


3.2  The  Two  Submodels 

Each  of  the  two  submodels  (SB  and  MB)  contains  similar  sets  of  customized  MVA  equations  to  compute  the  response  time 
for  a  transaction  in  the  memory  subsystem  (see  section  4).  The  only  differences  between  the  submodels  are  in  the  equations 
for  the  overall  residence  time,  R,  and  the  processor  residence  time,  Rpe .  We  discuss  these  differences  in  this  section  and  then 
discuss  the  CMVA  equations  that  are  common  to  both  submodels  in  Section  4. 

R  consists  of  the  residence  times  at  the  processor,  network,  network  interface  (NI),  bus  (both  local  and  remote),  and 
memory.  It  also  includes  a  term,  Z ,  which  represents  the  latencies  at  resources  with  negligible  contention  (e.g.,  cache 
tag  check).  The  difference  in  the  equations  for  Ti,  besides  the  difference  in  Rpe ,  is  that  Rsb  is  the  sum  of  the  residence 
times  of  the  synchronous  transactions,  whereas  Rmb  is  the  sum  of  the  residence  times  of  the  asynchronous  transactions  plus 
the  residence  times  of  the  reads  that  are  synchronous  in  the  MB  submodel  (this  will  be  discussed  below).  These  equations  are: 

p  _  p  _i_  jy  synch  ,  rysynch  ,  rysynch  ,  p synch  _i_  y synch 

USB  ~  XpesB  +  UNET  +  UNI  +  Ubus  +  Umem  +  Z 

p  _  p  _i_  ryasynch  ,  ryasynch  .  jyasynch  ,  ryasynch  yasynch 

J^MB  —  -ftpeMB  '  & NET  '  ^ NI  '  ^ bus  '  ^  mem  '  ^ 

_i_  p  /  p synch  .  rysynch  .  rysynch  .  rysynch  _i_  ysynch\ 

synch— read— MB  K^NET  +  UNI  +  Ubus  +  Kmem  +Z  ) 

In  the  SB  submodel,  the  number  of  customers  per  processor  is  equal  to  the  maximum  number  of  read  requests  that  can 
be  issued  before  the  processor  blocks  (i.e.,  one  of  the  observed  values  of  M ).  The  processor  (and  its  associated  cache 
subsystem)  is  a  FCFS  queue  that  initially  has  mean  service  time  equal  to  r.  Note  that  this  queue  is  only  idle  when  M 
memory  read  requests  are  outstanding;  otherwise  it  is  generating  memory  requests  at  rate  1/r .  If  the  request  is  a  write  miss, 
the  customer  is  routed  immediately  back  to  the  processor  while  simultaneously  forking  an  asynchronous  memory  write  or 
upgrade  transaction,  using  a  technique  similar  to  that  proposed  by  Heidelberger  and  Trivedi  [5] .  Thus,  the  Rsxynch  and  Zsynch 
terms  are  only  non-zero  for  read  requests  (see  Section  4). 

In  the  MB  submodel,  the  number  of  customers  per  processor  is  equal  to  the  number  of  MSHRs,  Mhw  MSHRs  can  be 
occupied  by  read,  write,  or  upgrade  requests;  however,  for  architectures  with  non-blocking  stores  and  in-order  retirement  of 
loads  and  for  M  <  Mhw  the  blocking  time  when  MSHRs  contain  M  read  requests  is  accounted  for  in  the  SB  submodel.  The 
additional  blocking  time  that  needs  to  be  computed  by  the  MB  model  is  for  the  case  that  the  MSHRs  contain  Mhw  requests 
of  which  less  than  M  are  read  requests.  That  is,  we  can  measure  M  read  requests  for  the  “infinite  MSHR”  system,  but  the 
system  with  Mhw  MSHRs  could  block  with  fewer  than  M  reads  in  the  MSHRs  because  some  registers  are  filled  with  other 
requests.  All  writes  and  upgrades,  plus  some  read  requests,  must  be  synchronous  in  the  MB  submodel  to  account  for  this 
additional  blocking  behavior. 
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The  following  four  equations  account  for  the  read  requests  that  should  be  synchronous  in  the  MB  submodel1.  The 
first  equation  estimates  Pj ,  the  probability  that  only  j  of  the  M  reads  that  were  measured  in  the  “infinite  MSHR”  sys¬ 
tem,  1  <  j  <  M,  are  in  the  first  M^w  MSHRs.  The  second  equation  estimates  the  utilization  of  the  processor  in  the  SB 
submodel,  and  this  utilization  is  used  in  the  third  equation  to  compute  U'pe,  which  is  an  estimate  of  the  probability  that  a 
customer  leaving  the  processor  in  the  SB  submodel  is  leaving  behind  a  non-empty  processor  queue.  The  third  equation  has 
a  term,  Usb,  which  is  the  probability  that  a  processor  in  the  SB  submodel  is  not  stalled,  and  it  will  be  explained  below.  By 
multiplying  Pj  by  JJ'pe  and  summing  over  j ,  we  obtain  the  probability  that  a  read  should  be  considered  synchronous  in  the 
MB  submodel,  as  shown  in  the  last  equation. 


Pj  =  Prob[j  reads  in  MSHRs  |  at  least  1  read  in  MSHRs] 


tt  _  Mr 

UPe  ~  RSb 


+ 


p  )(Mh„-j)  pU  t) 

1  upgrade )  -1  reaci 


UL  = 


Pe  Ur,, 


-+1  —Us  i 


synch— read— MB 


■■  Prob[read  is  synchronous  in  MB  submodel]  =  1  RjUp 


pe 


The  read  misses  that  are  not  synchronous  in  the  MB  submodel  are  immediately  routed  back  to  the  processor  (since  the 
processor  cannot  stall  on  these  read  misses  in  this  submodel)  while  simultaneously  forking  a  read  transaction  to  the  memory 
system,  again  using  a  technique  similar  to  that  in  [5], 

As  mentioned  above,  the  other  difference  in  the  equations  between  the  two  submodels  concerns  Rpe .  This  difference  arises 
from  how  we  represent  the  processor  stall  time  that  is  estimated  by  one  submodel  in  the  other  submodel.  That  is,  the  mean 
time  that  each  customer  occupies  the  processor  in  the  MB  submodel  is  equal  to  tmb ,  where  tmb  is  r  adjusted  to  reflect  the 
fraction  of  time  that  the  processor  is  stalled  due  to  load  or  read-modify-write  instructions  that  cannot  be  retired  (computed 
from  the  SB  model).  That  is,  tmb  =  jpjj-  Once  the  measures  are  computed  from  the  MB  model,  the  SB  model  is  solved 
again  using  tsb  =  U^1B  ,  where  Umb  is  the  fraction  of  time  that  the  processor  is  not  stalled  in  the  MB  submodel. 


tt  _  Mr 

UsB  ~  Rsb  Umb 

tt  _  Mr 

umb  -  Rmb  Usb 


The  alternating  solution  of  each  submodel  is  repeated  until  the  estimated  throughputs  converge.  This  approach  might  be 
named  the  “method  of  surrogate  service  time  inflation,”  analogous  to  the  method  of  surrogate  delays  [4,  6], 

The  equations  for  RpeMB  and  RPeSB  are  shown  below.  The  processor  residence  time  consists  of  the  customer’s  service 
time  and  the  amount  of  time  that  the  customer  waits  for  the  M  —  1  other  customers  that  might  be  either  waiting  or  in  service 
at  the  same  processor.  The  first  term,  -jjf-  —  is  the  estimated  fraction  who  are  waiting,  and  ^  is  the  estimated  fraction  in 
service. 


RpeMB  =  Tmb[  1  +  (M  -  1  )(^f  -  #£)]  +  (M  - 

RPesB  =  tsb[  1  +  (M  -  l)(^f  -  ^)]  +  (M  -  l)(^)TSBrestdual 

TSBreSiduai  and  TMBreSiduai  represent  the  residual  life  of  the  customer  being  served  at  the  processor  when  an  arriving 
customer  arrives.  As  explained  in  [8],  the  standard  equation  for  residual  life  under  the  assumption  of  Poisson  arrivals  is  not 
accurate  since  arrivals  at  the  processor  are  not  at  random  points  in  time;2  therefore,  we  use  an  interpolation  suggested  by 
Derek  Eager  [3], 

For  the  case  that  M  =  Mhw,  all  processor  stalls  can  be  attributed  to  full  MSHRs.  In  this  case,  we  solve  a  modified  MB 
model  in  which  there  are  Mhw  customers  per  processor  and  these  customers  represent  the  behavior  of  all  read,  write  and 

1  The  reads  in  the  MB  submodel  have  only  a  small  effect  on  estimated  throughput  (less  than  4%  reduction  in  throughput  for  all  applications  validated  in 
[8],  except  FFTopt  which  has  a  10%  reduction),  and  they  are  not  discussed  in  [8]. 

2The  estimated  mean  residual  life  for  random  arrivals  equals  the  second  moment  of  service  time  divided  by  2 r  [7], 
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upgrade  memory  system  transactions.  For  any  of  these  memory  requests,  the  customer  leaves  the  processor  and  visits  the 
appropriate  memory  system  resources. 

Once  throughput  is  computed  from  the  weighted  average  of  the  value  at  each  M,  synchronization  effects  are  accounted 
for  as  described  in  [8]. 

4  The  Customized  MVA  Equations 

As  explained  above,  the  SB  and  MB  submodels  use  a  set  of  customized  MVA  (CMVA)  equations  to  compute  the  mean 
delay  for  each  type  of  transaction  at  the  local  and  remote  memory  buses,  local  and  remote  directories  (and  associated  mem¬ 
ory  modules),  and  network  interfaces.  Fixed  delays  are  assumed  at  resources  that  have  negligible  contention  (e.g.,  cache 
tag  checks,  coherence  packet  generation),  and  for  the  approximate  delay  at  each  network  switch  (observed  across  several 
applications). 

The  equations  listed  in  this  section,  along  with  the  equations  for  R  and  Rpe  that  were  presented  in  the  previous  section, 
completely  define  both  submodels.  The  equations  in  this  section  are  the  same  for  both  submodels,  and  they  can  easily  be 
modified  to  model  different  memory  system  architectures,  as  was  done  in  [8]. 

To  make  the  equations  more  readable,  we  have  adopted  subscripts  and  superscripts  to  denote  the  possible  variations  in 
the  term  to  which  they  are  attached.  The  resource  is  always  the  first  subscript  on  a  term,  whether  it  is  residence  time  (R), 
waiting  time  (W),  utilization  (77),  or  service  time  ( S ).  For  example,  Rni  is  the  residence  time  at  the  network  interface.  For 
the  NI,  there  can  be  an  additional  keyword  (q)  appended  to  the  NI  to  indicate  that  this  term  can  vary  depending  on  whether 
the  action  is  at  the  output  queue  (q  =  out)  or  the  input  queue  (q  =  in)  of  the  NI.  For  many  terms,  there  is  a  subscript  of  loc 
or  rem  to  indicate  whether  the  action  is  at  the  local  node  or  a  remote  node.  The  variable  y  denotes  the  transaction  type  (see 
Table  3).  The  variable  x  denotes  the  type  of  message  (request  or  data)  that  is  on  the  bus  or  at  the  NI.  Lastly,  a  superscript 
variable  2  denotes  either  synchronous  or  asynchronous. 

For  example,  UjifIq  is  the  utilization  of  queue  q  (out  or  in)  at  the  local  NI  by  a  transaction  of  type  y.  The  x  and  z 
denote  whether  the  packet  is  a  request  packet  or  a  data  packet  and  whether  the  transaction  is  synchronous  or  asynchronous.  It 
is  important  to  note  that  a  synchronous  transaction  can  have  an  asynchronous  part  (e.g.,  an  acknowledgment  message  to  the 
home  node  that  occurs  in  a  3-hop  cache  to  cache  read  request).  For  example,  R^j^h  RHLCC  refers  to  the  response  time 
at  the  input  queue  of  the  remote  NI  for  the  asynchronous  request  or  data  message  from  the  synchronous  transaction  RHLCC. 


Latencies  at  Resources  with  Negligible  Contention 

ZSynch  —  [PLC  +  PRc)(Stag )  +  ( PlHCC  +  PrHLCC  +  PrH  HCc){S  Lag  +  Scoherence) 
gasynch  _  (Pic/Jvy  +  PrciNV  +  PluPG  +  PrUPG  +  PlWB  +  PRWB)(Stag ) 

+  {PLHCCwr  +  PRHLCCwr  +  PRHRCCwr){Stag  +  ^ coherence ) 


Network 

Note  that  Sswitch  is  the  measured  average  per-switch  delay  in  the  network,  measured  across  several  applications  in  a  given 
class  of  applications.  Sswitch  could  also  be  estimated  by  a  more  detailed  MVA  model  of  the  interconnection  network. 

SnET  —  77 S switch 

-Qsynch  _  r^synch 

nNET  ~  l^y  nNET,y 
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Qasynch  Qasynch 

UNET  ~  l^y  nNET,y 

rysynch  \ 

UNET,y  ~  | 

Pv2Snet 

y=RC,LHCC,RHLCC 

k  Pv3Snet 

y=RHRCC 

PvSnet 

y=RWB 

-Qasynch 
UNET,y  ~  < 

Pv2Snet 

y=RCINV,LHCCwr,RHLCCwr,LUPG 

Pv3Snet 

y=RHRCCwr 

Pv4Snet 

y=RUPG 

Network  Interface 


Below  are  the  visit  count  equations  for  the  NI.  For  example, 


is  the  visit  count  at  the  output  queue  of  the  local 


NI  of  request  messages  associated  with  the  synchronous  part  of  a  transaction  of  type  y. 


1  y=RC,LHCC,RHLCC,RHRCC 

1  y=RC,LHCC,RHLCC,RHRCC 

I  1  y=RCINV,RUPG,LHCCwr,RHLCCwr,RHRCCwr 
\  X  y=LCINV,LUPG 

1  y=RWB 

(  X  y=LCINV,LUPG 
1  y=RUPG 
[  1  y=LHCCwr 

1  y=RCINV,LHCC,LHCCwr,RHLCCwr,RHRCCwr 

=  (wb)  y=Rhrcc 

:  y=RC,LHCC,RHLCC,RHRCC 

y=RC,LHCC,RHLCC,RHRCC 

(ttt)  y=RHRCC 

'  (X-)  y=LCINV,LUPG 
(tFi)  y=RCiNV 
(Wr)  y=RUPG 
(j±j)  y=LHCCwr,RHLCCwr 
,  C^t)  y=RHRCCwr 

:  (^F_)  y=RCINV,LHCC,RHLCC,RHRCC,LHCCwr,RHLCCwr,RHRCCwr 

'  (XL-)  y=LCINV,LUPG 
(^r)  y=RCINV,RUPG 
<  (jvb)  y=LHCCwr 
(j^t)  y=RHLCCwr 
.  fe)  y=RHRCCwr 

(jvb)  y=RWB,RHLCC,RHRCC 
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The  residence  time  at  the  NI  is  composed  of  the  response  times  at  the  output  queue  and  the  input  queue. 

-rysynch  _  r^synch  .  rpsynch 

UNI  ~  UNIout  UNIin 

-ryasynch  _  ryasynch  ■  oasync/i 

nNi  —  ft NIout  u NHn 

The  residence  time  at  the  output  (input)  queue  is  the  sum  over  all  transaction  types  y  of  the  residence  times  at  the  local 
and  remote  output  (input)  queues.  For  each  transaction,  we  include  both  types  of  messages  (i.e.,  request  and  data)  that  are 
generated  by  the  synchronous  part  of  the  transaction. 


P NIout  ^2y{-^NIoutioc  y  r  ^NIoutioc  y  d^  1)  ^2y{^NIoutrerrljyjr  PN I OUtr em  y  d) 

P Nlin  —  ^2y(^NIiniOCiy^r  +  ^NIiniOCiyid)  +  ~  ^2y(^NIinrerntytr  +  P~NIinrern  :V  >d  ) 

The  next  set  of  equations  describe  the  residence  times  of  the  different  types  of  messages  at  the  input  and  output  queues 
of  the  NI.  For  example,  R^Zi  is  the  residence  time  at  the  output  queue  of  the  local  NI  of  a  message  of  type  x  that 
is  associated  with  the  synchronous  part  of  a  transaction  of  type  y.  This  time  equals  the  probability  of  transaction  y  times 
the  visit  count  at  this  queue  times  the  sum  of  the  per-visit  waiting  time  and  the  service  time  that  this  type  x  message  will 
experience. 

PvKToCutlo c,y,x  (WNicutloc  +  SNiout,x)  y=RC,LHCC,RHLCC,RHRCC 

WiZL.y,*  (wNiinlo c  +  y=RC,LHCC,RHLCC,RHRCC 

PyVmZL  y  .  (WNioutloc  +  SNIout,x)  y=RWB,RCINV,RUPG,LCINV,LUPG, 

LHCCwr.RHLCCwr.RHRCCwr 

ayv;szt  y  .  (WNiinloc  +  SNiin,x)  y=RCINV,RUPG,LHCC,LCINV,LUPG, 

LHCCwr.RHLCCwr.RHRCCwr 

PyV^o:iem,yjWNioutrern  +  SNlouL,x)  y=RC,LHCC,RHLCC,RHRCC 

PyVZZL,yJW»i™^  +SNHn,.)  y=RC,LHCC,RHLCC,RHRCC 

PyK-ZZL,  y  .  (WNioutrern  +  SNIout,x )  y=LHCC,RHLCC,RHRCC,LHCCwr,RHLCCwr, 

RHRCCwr,LCINV,LUPG,RCINV,RUPG 

PyVmZ 1  y  .  (WNiinrern  +  SNnn,x)  y=RWB,RHLCC,RHRCC,LHCCwr,RHLCCwr, 

RHRCCwr,LCINV,LUPG,RCINV,RUPG 

The  following  equations  describe  the  utilization  of  the  queues  of  the  NI.  The  notation  is  similar  to  that  for  the  residence 
times  of  the  NI. 


R 


synch 

N I  OUtlQQ  y  X 


jysynch  _  j 

Iinioc  y  x  ~  1 


j^asynch 

J^'NIoUtlOCjy,i 


-Qasynch  _ 

N I  ini  oc  ,y  ,x 


R 


synch 
N I  out r 


rpsynch 

nNIinre 


-Qasynch 

nNIoutre 


j^asynch 


Unio 


Ejjsynch 
y  NIqioc,y,x 


.  jjasynch 

‘  ^ V  NIqioc,y,x 


U: 


NIqr 


Ejjsynch 
v 


jy  ^  NIqr 


.,..+zyuzrq 


nch 

N I qrem  ,y  ,x 


The  utilization  of  resource  q  of  the  local  (remote)  NI  by  messages  of  type  x  during  a  transaction  of  type  y  is  equal  to  the 

p 

throughput  of  type  y  transactions  (~^)  multiplied  by  the  average  number  of  type  x  messages  per  type  y  transaction  that  visit 
this  resource  (V^]qj  ),  multiplied  by  the  service  time  of  these  messages.  The  z  distinguishes  between  synchronous 

and  asynchronous  transactions. 

PNIqioc,y,x=  (^R^NIqlocyxSNIq,x 
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^Nlqrem, y,x  ~  (  R  )  Ev/grem  ^NI q,x 

The  next  equations  are  used  to  calculate  the  waiting  time  of  a  customer  who  arrives  at  one  of  the  NTs  queues.  WNiqioc 
is  the  waiting  time  at  queue  q  of  the  local  NI.  It  is  the  sum  of  the  waiting  time  due  to  locally  generated  messages  ( Wwiqioc ) 
and  remotely  generated  messages  ).  The  subscript  is  the  resource  where  the  waiting  occurs,  and  the  superscript  is 

for  the  messages  that  cause  the  waiting.  For  FFjv/grem .  the  waiting  time  at  queue  q  of  the  remote  NI  is  due  to  “others”  (other 
processors  -  the  processor  that  generated  the  arriving  request  and  all  other  processors  except  the  one  that  is  local  to  the  given 
remote  NI)  and  “remote”  (the  processor  at  the  given  remote  NI). 


wNiqioc=w^qioc+w^Tqioc 

wNigrem  =  WfrXL  +  wNi™rern 

W Niqioc  is  the  waiting  time  at  queue  q  of  the  local  NI 
times  at  this  queue  due  to  locally  generated  messages  from 

n%!oc  =  E,(<°;£+<°;£) 

WhTqioc  =  Ey(wZZ?  + 

WSrrerS  =T  (W*frerS’yS  +W?TTerS’yA) 

VyNIqrern  Z-iy\yV  NIqrem  ~yyNIqrern  ) 

=  E  MT’V1  +  wrNT’yA) 


of  the  local  NI  due  to  locally  generated  messages.  It  is  composed  of  the  waiting 
messages  from  the  synchronous  part  and  the  asynchronous  part  of  all  transactions. 


The  next  equation  is  used  to  calculate  the  waiting  time  at  queue  q  of  the  local  NI  due  to  the  synchronous  part  of  the 

transactions  of  type  y  generated  by  the  local  processor.  Breaking  the  equation  down,  we  note  that  (  HIlisziiLZ  _  {/jyj  x ) 
is  the  average  number  of  customers  in  the  queue  minus  the  number  of  customers  being  served  (i.e.,  the  utilization,  a  number 
between  zero  and  one).  We  have  to  wait  for  all  of  these  customers  to  complete,  so  this  queue  length  is  multiplied  by  the 
service  time.  For  the  customer  in  service,  we  wait  for  its  residual  life  in  service,  which  is  equal  to  half  the  service  time  for 
a  deterministic  service  time.  Without  the  summation  and  factor  of  (M  —  1),  we  have  the  traffic  due  to  only  one  customer’s 
messages  of  type  x.  Therefore,  we  sum  over  the  message  types,  and  we  multiply  by  a  factor  of  (M  —  1)  because  an  arriving 
customer  waits  for  the  traffic  of  the  (M  —  1)  other  customers  of  the  processor  local  to  the  NI.  The  other  waiting  time  equations 
are  similar  to  the  one  described. 


1)[(- 


.m 


Iqloc 


TJS 

NI <Jloc,y  ,x 


)(%*)] 


The  next  equation  has  a  factor  of  M,  instead  of  M  —  1,  since  a  given  local  customer  can  wait  for  the  asynchronous  requests 
generated  by  all  M  customers  of  the  local  processor  (i.e.,  including  the  arriving  customer).  Note  that  j^Py  is  the  rate  at  which 
an  asynchronous  transaction  y  is  forked.  Multiplying  this  throughput  by  the  residence  time  at  a  queue  gives  the  mean  queue 
length  for  an  asynchronous  transaction.  Thus,  the  form  of  these  customized  equations  is  very  similar  for  synchronous  and 
asynchronous  transactions. 


K°j£  =  E*  M\(^ fife.  -  u$Iqioc  y  jsNIq,x  +  (U^Iqioc  y  x)(^)] 

WN%:?1  =  E„(JV -  1).'/£A,V  *  - uzNIqremyx)sNIq,x  +  (u*NIqrem^)(^)] 

w^q;:f  =  e  .w  - 1) + m(N  -  2)][(%tf  -  usIqremyx)sNIq,x + 

W°Nt;i:yA  =  E AM  +  M(N  -  2)][(^ 

wiry 


rem,y,x  T  T A 

—R 


7nT,:::  =  E, 
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Bus 


The  notation  and  equations  for  the  bus  and  the  other  memory  subsystem  resources  are  very  similar  to  those  of  the  NI. 
These  equations  are  given  without  further  explanation. 


-isynch  _ 


Jbus 


loc,y  ,r 


-isynch  _ 

bUSloc  yd 


-icisynch  _ 


Jbus 


loc,y  ,1 


- tasynch  _ 

busioc  ,y  ,d 

- isynch  _ 

busrem  ii  ji 


- isynch  _ 


Jbusq 


em,y,d 


-iasynch  _ 

^ bxiS-pem  ,r 


Sbus,r  y=LC,RC,LHCC,RHLCC,RHRCC 
Sbus,d  y=LC,RC,LHCC,RHLCC,RHRCC 

Sbus.T  y=LHCC,RHLCC,RHRCC,LUPG,LCINV,RUPG,RCINV,LHCCwr,RHLCCwr,RHRCCwr 
Sbus,d  y=LHCC,RHLCC,RHRCC.LCINV,RCINV,LWB,RWB,LHCCwr,RHLCCwr,RHRCCwr 
Sbus,T  y=RC,LHCC,RHLCC,RHRCC 
Sbus,d  y=RC,LHCC,RHLCC,RHRCC 

Sbus,T  y=RHLCC,RHRCC,LCINV,LUPG,RCINV,RUPG,LHCCwr,RHLCCwr,RHRCCwr 


SZTch  =  Sbu,  d  y=RHLCC,RHRCC,RWB,RCINV,LHCCwr,RHLCCwr,RHRCCwr 

OUSrern  y  d  ’  J 

■y  synch  _  f  1  y=LC,RC,RHLCC,RHRCC 

busloc,y,T  -  |  2  y=LHCC 

ysynch  =  x  y=LC,RC,LHCC,RHLCC,RHRCC 

bUSloc,y,d  J 


V, 


asynch 


busi0 


y asynch 

bUSlQQ^y  ,d 


V -synch 
bn  s 

u  LLo rem  ,y  ,v 

■rrsynch 
busrem  ,y  ,d 


'  1  y=RCINV,RHLCCwr,RHRCCwr 

2  y=RUPG 

<  3  y=LHCCwr 

2X  +  2  y=LUPG 
2X  +  1  y=LCINV 

{  1  y=LCINV,RCINV,LWB,RWB,LHCC,LHCCwr,RHLCCwr,RHRCCwr 

(  (jtt)  y=RC,LHCC 
{  fe)  y=RHLCC 
l  (jrr)  y=RHRCC 
:  (j^-)  y=RC,LHCC,RHLCC,RHRCC 


yasy 
v  busr 


=  < 


fe) 

(t&) 

(4X+2\ 

V  jV-1  > 

V  JV-l  > 


yasy 
v  bus-. 


rrsynch  _ 

nbus 


y=LHCCwr 
y=RHLCCwr 
y=RHRCCwr 
y=RCINV 
y=RUPG 
y=LCINV,LUPG 

^3-)  y=RWB,RCINV,LHCC,LHCCwr,RHLCCwr,RHRCCwr 
)  y=RHLCC,RHRCC 

Er psynch  ,  psynch  \  r psynch  ,  psynch  \ 

y^bustoayr  '  ^busioayuj)  '  VJV  1 yV^lm  S  r  ern  '  rtbusremyyydJ 


1 nch 
em  ,y  ,d 


{  (ivh 
l  (i^r 
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asynch  _ 


'nch 

bUSlQQ^y 


■  jyasynch 
bUSlQQ^y  ^d 


)  +  (JV  - 1)  Zu(Kynch 


vbusr 


.  rjasynch  \ 
busrern^y^d ' 
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j^synch 
bziSioc,y  ,x 

RasyncH  =  PyV^JW^  + 
bziSioc,y  ,x 

pyVbZ:tyJWbuSrem  +S!b^my 

KuTch  =  PyVZynch  (wbuSrem  +  szrh  )  y=RWB 

'J eui  ,y  ,x  J  u  U'J  v  em  ,y  ,x  ^ 1 1  ‘  u u'e,rem,  ,y  ,x 

zh 


P  t /syn 

rV  ybusi, 


T^synch 

u  UjO'pq  y  x 


’uZt.- (Wbus^  +  S^L.~ ]  y=LC,RC,LHCC,RHLCC,RHRCC 

**?nch  ( Wbus ,  +SZTch  )  y=LUPG,LCINV,LWB,RWB,RCINV,RUPG,LHCC, 

USloc,y,x  v  loc  0U8loc,y,x'  J 

RHLCC,RHRCC,LHCCwr,RHLCCwr,RHRCCwr 
rsynch.  (Wi  +  )  y=RC,LHCC,R  H  LCC,  R  H  RCC 

-  'JU'°Tevn,y,x  '  *' 


y 

Ejjsynch 
y  busioc,y 

Ejjsyn 
y  ^  bus-, 

=  i%)viusu 
=  Cit)vzs, 
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Tjsynch  ,  y'  jjasynch 
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-  -  -•  -  /-Jy  bllSlQQ  y  ; 
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Memory 

If  a  decoupled  directory  is  desired,  the  following  equations  can  also  be  used  for  the  directory,  with  the  appropriate  changes 
to  the  visit  counts. 
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